White Wine Dataset Report, by Anthony Munnelly

Univariate Plots Section

First, I plotted histograms of all the variables, to see what their individual distributions were like.

Fixed Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

Fixed acidity is evenly distributed through the sample - the mean and median values are almost the same.

Volatile Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Volatile acidity is considerably lower fixed acidity, and is slightly right-skewed in its distribution.

Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Citric acid is again lower than fixed acidity, with a slight right-skew to the distribution.

Residual Sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Residual sugar is very right-skewed, and has an outlier that’s a long way out. Let’s look at a boxplot of residual sugar, as it’s easier to see outliers on boxplots.

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Again, a strong right-skew. We’ll look at another boxplot.

A remarkable number of outliers.

Free Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

Again, quite a number of outliers.

Total Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

More outliers.

Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

And yet more outliers.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

This histogram looks a little bunched. We’ll look at this a little more closely.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

A nearly-normal distribution for the majority of the data.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

Another nearly-normal distribution.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

A completely irregular pattern for alcohol distribution.

Univariate Analysis

What is the structure of your dataset?

This is the structure of the data,

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...

and these are summaries of each variable in the data set.

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##                                                                        
##     alcohol      quality 
##  Min.   : 8.00   3:  20  
##  1st Qu.: 9.50   4: 163  
##  Median :10.40   5:1457  
##  Mean   :10.51   6:2198  
##  3rd Qu.:11.40   7: 880  
##  Max.   :14.20   8: 175  
##                  9:   5

What is/are the main feature(s) of interest in your dataset?

The main features of interest in the white wine dataset are quality and alcohol. These are the two most likely reasons why people buy and drink wine in the first place.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Theory suggests that sulphates and residual sugar have strong influences on the quality and alcohol content of a wine. This theory is investigated below.

Did you create any new variables from existing variables in the dataset?

I changed the nature of quality from an integer to a factor. This is a table of the results.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The histograms for the residual sugar, chlorides and density histograms seemed a little packed. As such, I created boxplots for residual sugar and chlorides, as these had longish outliers, while I adjusted the x-axis limits for the density distribution. pH is a nearly-normal distribution of median value 3.18 and average value 3.188. The rest of the data are generally right-skewed, with low mean and median values.

Bivariate Plots Section

First, we’ll look at variable histograms broken down by quality. Wines of Quality 9 have been excluded because there are only five them in the entire dataset, and so small a sample can lead to very misleading impressions.

Fixed Acidity v Quality

Fixed acidity has the greatest variance among wines of Quality 3, but the distribution is more or less the same across the qualities.

Volatile Acidity v Quality

Volatile Acidity shows about the same variance across the qualities, and the number of outliers is noticeable.

Citric Acid v Quality

This distribution is one of the most compact in the dataset - the inter-quartile range is just 0.12, and the median lines nearly match each other across the different qualities of wine.

Residual Sugar v Quality

Residual sugar has some of the fewest outliers across the variables, with one of the larger inter-quartile ranges.

Chlorides v Quality

By contrast, chlorides shows an enormous amount of outliers, with an interquartile range of just 0.014.

Free Sulfur Dioxide v Quality

A compact distribution across the qualities.

Total Sulfur Dioxide v Quality

A distribution with a greater variance than for free sulfur dioxide.

Sulphates v Quality

The median lines nearly match each other across the different qualities of wine.

Density v Quality

Quite compact, with the exception of one large outlier in wines of quality 6.

pH v Quality

Varied and relatively even distributions across the qualities.

Alcohol v Quality

This is the most informative of all the boxplots. It’s clear from the plot that the alcohol content of a wine in this dataset increases relative to its quality.

Residual Sugar v Alcohol

A random graph (the Pearson’s Coefficient for residual sugar and alcohol is -0.451), but interesting for one outlier - there is one wine in the dataset that is almost absurdly sweeter than the rest. These are the details of that wine:

##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 2782 2782           7.8            0.965         0.6           65.8
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 2782     0.074                   8                  160 1.03898 3.39
##      sulphates alcohol quality
## 2782      0.69    11.7       6

It’s interesting to note that this is the same wine that showed as an outlier in the residual.sugar and density boxplots. Could there be a relationship between density and residual sugar?

Density v Residual Sugar

Yes, there is. The Pearson’s Coefficient of Density v Residual Sugar in this dataset is 0.839. We can conclude that there is a strong positive correlation between density and residual sugar in the dataset.

Density v Alcohol

There is a strong negative correlation between density and alcohol content. The Pearson’s Coefficient is -0.78.

Free v Total Sulfur Dioxide

There is a positive correlation between total sulfur dioxide and free sulfur dioxide, with a Pearson’s Coefficient of 0.616.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The small sample size (five) for wines of Quality 9 distorted the results. As such, these five were removed to generate the graphs in this part of the report. Of the features of interest, it’s clear that there is a strong relationship between alcohol and quality. There is no case to be made for residual sugar or free sulfur content having any effect on a wine’s quality. The other variables are distributed in more or less the same way across the seven different qualities of wine in the dataset.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There is a strong positive correlation between density and residual sugar (Pearson’s Coefficient: 0.839), and a weaker correlation between total and free sulfur dioxide (Pearson’s Coefficient: 0.616).

What was the strongest relationship you found?

The strongest relationship I found was that between density and residual sugar, a relationahip which has a Pearson’s Coefficient of 0.839.

Multivariate Plots Section

Density v Residual Sugar v Quality

The strong positive correlation between density and residual sugar is consistent through the different qualities of wine.

Density v Alchol v Quality

The strong negative correlation between density and alcohol is also consistent through the different qualities of wine.

Free Sulfur Dioxide v Total Sulfur Dioxide v Quality

The correlation between free and total sulfur dioxide is not consistent across the different qualities of wine. It is most consistent for wines of quality 5 and 6, less so for the higher and lower quality wines.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Having discovered correlations between density and alcohol, density and residual sugar, and free and total sulfur dioxide, it made sense to look at these faceted by quality. The results repeated across the different qualities.

Were there any interesting or surprising interactions between features?

The greatest surprises, or perhaps clearest results, were in the bivariate analysis section. These multivariate plots served chiefly to reinforce what had gone before.

Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I did not create any model with the dataset. The only features of the dataset that could be modeled are the correlations between density and alcohol and density and residual sugar. These relationships is only of interest to chemists, and chemists are probably already aware of it. They are of little interest to the customer nor of the sommelier - the customer doesn’t want to study chemistry, and the sommelier knows that there’s far more to wine-making than statistical modelling.


Final Plots and Summary

Plot One: Alcohol v Quality

Description

This is the most informative plot in the dataset, clearly showing the relationship between alcohol content and wine quality. The five boxplots show the alcohol content dropping over wines of quality 3, 4 and 5 before rising steeply again in wines of quality 6, 7 and 8.

Plot Two: Alcohol v Density

Description

This graph plots density against alcohol for the sample data. The plot demonstates a strong negative correlation between density and alcohol - they have a Pearson’s Coefficient of -0.78. The points on the scatter plot are set at alpha = 0.25 to reduce over-plotting. Some outliers have been removed to make the plot more clear.

Plot Three: Density v Residual Sugar

Description

Residual sugar is plotted against density in a scatterplot graph, demonstrating a strong positive correlation between them - they have a Pearson’s Coefficient of 0.839. The points are colored according to their quality to add further information to the graph, and are set at alpha = 0.25 to reduce over-plotting. Some outliers have been removed to make the plot more clear.


Reflection

The chief feature of my investigation was the relationship between quality and alcohol. A clear relationship was found using a boxplot of alcohol v quality – the higher the quality of wine in the dataset, the higher the alcohol content of that wine. The boxplot of alcohol distribution versus quality is the most informative plot in this report.

The strong positive correlation between density and residual sugar was an unexpected result of this investigation, discovered by investigating just one outlier value that repeated in both.

The other strong correlation in this dataset is between density and alcohol, which is not a factor when people are buying wine. While this is a disappointment to statisticians, it is almost certainly good news for sommeliers, who can be reassured that their specialty is indeed more art than science.

The distribution of qualities among the dataset, with many instances of medium-quality wines and relatively fewer instances of the lower and higher extremes, was unfortunate in this regard. Equal samples of all seven qualities would have been better suited to my own investigation.

The most obvious next step would be to look at the data for red wines, and compare these to this white wine dataset. However, the usefulness or otherwise of that comparison is dependent on what the researcher wants from the dataset.

There is a clear division between the attraction of this dataset to a chemist and to an oenophile. The strongest correlations in the dataset are of interest to chemists only, and of limited interest (or intelligibility) to the civilian population. An expansion of the data to red wines will be of interest to a chemist, but of little interest to non-chemists. While a vocation as a chemist doesn’t preclude someone being an oenophile, the true oenophile knows that the sommelier’s trade is always much more art than science.